AITopics | character level

Collaborating Authors

character level

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EXECUTE: A Multilingual Benchmark for LLM Token Understanding

Edman, Lukas, Schmid, Helmut, Fraser, Alexander

arXiv.org Artificial IntelligenceMay-26-2025

The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.

benchmark, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.17784

Country: North America > United States (0.46)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Computational Analysis of Gender Depiction in the Comedias of Calder\'on de la Barca

Keith, Allison, Castro, Antonio Rojas, Padó, Sebastian

arXiv.org Artificial IntelligenceNov-6-2024

In Spain, the Baroque period, was a period of immense artistic creativity, genereally known as the "Golden Age" (siglo de oro). This is particularly true in literature, where the period saw exceptional writers such as Lope de Vega, Tirso de Molina or Pedro Calderón de la Barca. The latter, who lived from 1600 to 1681, is generally considered as as one of the most important playwrights of the age. He was immensely productive, writing a total of over 200 theatrical plays, both secular and religious, which had a lasting impact on Spanish theatre and beyond [17]. He is particularly known for detailed and complex characterizations in his works [46]. Not surprisingly, Calderón's writings have been subject to intense analysis by literary scholars over a long period of time, and topics have moved in and out of fashion. For example, traditional foci of scholarship have been the role of honor and power in the works [19] or Calderón's attention to dramatic structure [43]. A relatively new aspect among these is gender depiction, that is, the question of how Calderón conceptualized male and female roles in his plays differently, which has gained global attention in Hispanic Studies since the latter half of the 20th century ([2, 32, 39]).

computational linguistic, gender, prediction, (13 more...)

arXiv.org Artificial Intelligence

2411.03895

Country:

Europe > Spain (0.34)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
North America > United States > Nevada (0.04)
(12 more...)

Genre: Research Report > New Finding (0.68)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Mind Scramble: Unveiling Large Language Model Psychology Via Typoglycemia

Yu, Miao, Mao, Junyuan, Zhang, Guibin, Ye, Jingheng, Fang, Junfeng, Zhong, Aoxiao, Liu, Yang, Liang, Yuxuan, Wang, Kun, Wen, Qingsong

arXiv.org Artificial IntelligenceOct-23-2024

Research into the external behaviors and internal mechanisms of large language models (LLMs) has shown promise in addressing complex tasks in the physical world. Studies suggest that powerful LLMs, like GPT-4, are beginning to exhibit human-like cognitive abilities, including planning, reasoning, and reflection. In this paper, we introduce a research line and methodology called LLM Psychology, leveraging human psychology experiments to investigate the cognitive behaviors and mechanisms of LLMs. We migrate the Typoglycemia phenomenon from psychology to explore the "mind" of LLMs. Unlike human brains, which rely on context and word patterns to comprehend scrambled text, LLMs use distinct encoding and decoding processes. Through Typoglycemia experiments at the character, word, and sentence levels, we observe: (I) LLMs demonstrate human-like behaviors on a macro scale, such as lower task accuracy and higher token/time consumption; (II) LLMs exhibit varying robustness to scrambled input, making Typoglycemia a benchmark for model evaluation without new datasets; (III) Different task types have varying impacts, with complex logical tasks (e.g., math) being more challenging in scrambled form; (IV) Each LLM has a unique and consistent "cognitive pattern" across tasks, revealing general mechanisms in its psychology process. We provide an in-depth analysis of hidden layers to explain these phenomena, paving the way for future research in LLM Psychology and deeper interpretability.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.01677

Country:

North America > United States (0.46)
Asia (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Energy > Oil & Gas (1.00)
Leisure & Entertainment > Sports > Football (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

CUTE: Measuring LLMs' Understanding of Their Tokens

Edman, Lukas, Schmid, Helmut, Fraser, Alexander

arXiv.org Artificial IntelligenceOct-2-2024

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

benchmark, computational linguistic, llm, (16 more...)

arXiv.org Artificial Intelligence

2409.15452

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
North America > Canada > Ontario > Toronto (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback

Large Language Models Lack Understanding of Character Composition of Words

Shin, Andrew, Kaneko, Kunitake

arXiv.org Artificial IntelligenceJun-19-2024

Large language models (LLMs) have demonstrated remarkable performances on a wide range of natural language tasks. Yet, LLMs' successes have been largely restricted to tasks concerning words, sentences, or documents, and it remains questionable how much they understand the minimal units of text, namely characters. In this paper, we examine contemporary LLMs regarding their ability to understand character composition of words, and show that most of them fail to reliably carry out even the simple tasks that can be handled by humans with perfection. We analyze their behaviors with comparison to token level performances, and discuss the potential directions for future research.

character composition, language model, llm, (14 more...)

arXiv.org Artificial Intelligence

2405.11357

Country:

Africa > Middle East > Egypt > Giza Governorate > Giza (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
Europe > Monaco (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

BSpell: A CNN-Blended BERT Based Bangla Spell Checker

Rahman, Chowdhury Rafeed, Rahman, MD. Hasibur, Zakir, Samiha, Rafsan, Mohammad, Ali, Mohammed Eunus

arXiv.org Artificial IntelligenceDec-31-2023

Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checker

bspell, dataset, representation, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.banglalp-1.2

2208.09709

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Texas (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

Improving Scene Text Recognition for Character-Level Long-Tailed Distribution

Park, Sunghyun, Chung, Sunghyo, Lee, Jungsoo, Choo, Jaegul

arXiv.org Artificial IntelligenceMar-31-2023

Despite the recent remarkable improvements in scene text recognition (STR), the majority of the studies focused mainly on the English language, which only includes few number of characters. However, STR models show a large performance degradation on languages with a numerous number of characters (e.g., Chinese and Korean), especially on characters that rarely appear due to the long-tailed distribution of characters in such languages. To address such an issue, we conducted an empirical analysis using synthetic datasets with different character-level distributions (e.g., balanced and long-tailed distributions). While increasing a substantial number of tail classes without considering the context helps the model to correctly recognize characters individually, training with such a synthetic dataset interferes the model with learning the contextual information (i.e., relation among characters), which is also important for predicting the whole word. Based on this motivation, we propose a novel Context-Aware and Free Experts Network (CAFE-Net) using two experts: 1) context-aware expert learns the contextual representation trained with a long-tailed dataset composed of common words used in everyday life and 2) context-free expert focuses on correctly predicting individual characters by utilizing a dataset with a balanced number of characters. By training two experts to focus on learning contextual and visual representations, respectively, we propose a novel confidence ensemble method to compensate the limitation of each expert. Through the experiments, we demonstrate that CAFE-Net improves the STR performance on languages containing numerous number of characters. Moreover, we show that CAFE-Net is easily applicable to various STR models.

machine learning, natural language, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2304.08592

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Text Recognition (0.62)

Add feedback

Discriminating Between Similar Nordic Languages

Haas, René, Derczynski, Leon

arXiv.org Artificial IntelligenceMar-23-2023

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokm{\aa}l), Faroese and Icelandic.

artificial intelligence, machine learning, tatoeba data, (18 more...)

arXiv.org Artificial Intelligence

2012.06431

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
South America > Brazil (0.04)
North America > United States (0.04)
(2 more...)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.32)

Add feedback

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Zhang, Shiyue, Chaudhary, Vishrav, Goyal, Naman, Cross, James, Wenzek, Guillaume, Bansal, Mohit, Guzman, Francisco

arXiv.org Artificial IntelligenceSep-10-2022

A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling for tokenizer training from sampling for model training and show that the model is more sensitive to the latter.

character level, experiment, unk rate, (15 more...)

arXiv.org Artificial Intelligence

2204.14268

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > Dominican Republic (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Back Translation in Text Augmentation by nlpaug

#artificialintelligenceAug-29-2020, 07:35:11 GMT

English is one of the languages which has lots of training data for translation while some language may not has enough data to train a machine translation model. Sennrich et al. used the back-translation method to generate more training data to improve translation model performance. Given that we want to train a model for translating English (source language) Cantonese (target language) and there is not enough training data for Cantonese. Back-translation is translating target language to source language and mixing both original source sentences and back-translated sentences to train a model. So the number of training data from the source language to target language can be increased.

machine learning, natural language, training data, (16 more...)

#artificialintelligence

Country: Europe > Germany (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.97)

Add feedback